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Abstract 

Motivation: As costs of genome sequencing have dropped precipitously, development of effi¬ 
cient bioinformatic methods to analyze genome structure and evolution have become ever more 
urgent. For example, most published phylogenomic studies involve either massive concatenation 
of sequences, or informal comparisons of phylogenies inferred on a small subset of orthologous 
genes, neither of which provides a comprehensive overview of evolution or systematic identi¬ 
fication of genes with unusual and interesting evolution (e.g. horizontal gene transfers, gene 
duplication and subsequent neofunctionalization). We are interested in identifying such “outly¬ 
ing” gene trees from the set of gene trees and estimating the distribution of the tree over the 
“tree space”. 

Results: This paper describes an improvement to the KDETREES algorithm, an adaptation of 
classical kernel density estimation to the metric space of phylogenetic trees (Billera-Holmes- 
Vogtman treespace), whereby the kernel normalizing constants, are estimated through the use 
of the novel holonomic gradient methods. As the original kdetrees paper, we have applied 
kdetrees to a set of Apicomplexa genes and it identified several unreliable sequence alignments 
which had escaped previous detection, as well as a gene independently reported as a possible 
case of horizontal gene transfer. 

Availability: The updated version of the kdetrees software package is available both from 
CRAN (the official R package system), as well as from the official development repository on 
Github. (github.com/grady/kdetrees). 

Contact: ruriko.yoshida@uky.edu 


1 Introduction 


One of the great opportunities offered by modern genomics is that phylogenetics applied on a 
genomic scale (phylogenomics) should be especially powerful for elucidating gene and genome evo¬ 
lution, relationships among species and populations, and processes of speciation and molecular 
evolution. However, a well-recognized hurdle is the sheer volume of genomic data that can now 
be generated relatively cheaply and quickly, but for which analytical tools are lagging. There is a 
major need to explore new approaches to undertake comparative genomic and phylogenomic stud¬ 
ies much more rapidly and robustly than existing tools allow. Here, we focus on the problem of 
efficiently identifying significant discordance among a set of gene trees, as well as estimating the 
distribution of gene trees from the given set of trees. 

The kdetrees algorithm introduced in Weyenberg et al. [201.4] is an adaptation of classical 
kernel density estimation to the metric space of phylogenetic trees defined by Billera et al. 2001 . It 


is a computationally efficient method of estimating the density of the trees over the Billera-Holmes- 
Vogtman (BHV) treespace, and relies on a fast implementation of the BHV geodesic distance 
function provided by Owen and Provan 12011 . The method then uses the density estimates to 


identify putative outlier observations. This paper describes an improvement to kdetrees, whereby 
the kernel normalizing constants, are estimated through the use of the novel holonomic gradient 
methods [Koyama et al. 


2014 Marumo et al, 2014 
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In our original paper describing the kdetrees method, we propose a nonparametric estimator 
of the form, 

1 N 

fWcx-Y'kiT^hi). 

Z— 1 

In the kdetrees software, the kernel function implemented is a spherically symmetric gaussian 
kernel, i.e. 

k(T,r,h) ocexp^^^yj . ( 1 ) 

Since we are, for the moment, interested primarily in using the estimator / for outlier detection, 
knowledge of the overall proportionality constant for / is not of significant importance. However, 
it is important to know how the normalizing constant associated with k(T,T',h ) varies with the 
selected bandwidth and with the location of the kernel’s center. In our original paper we argued 
that, in practice, estimates of the normalizing constant do not appear to have significant systematic 
variation, and that assuming a constant value was a reasonable first approximation. This paper 
presents basic results and techniques for obtaining better approximate values for these normalizing 
constants. 

In the case case of Euclidean A:-space with the usual metric, the kernel ([!]) corresponds to a 
spherically symmetric multivariate normal distribution centered on the point T ', and the kernel 
normalization constant is given by 

c(T',hi) = (2irhi )- k/2 . (2) 

Note that not only is there a simple closed form solution for the constant, but the constant is 
invariant under changes to the central point T'. However, when applied to the BHV treespace with 
the geodesic metric, not only is such a closed form solution apparently unavailable, but it is also 
clear that the constant will depend on the location of the central point. 

For example, consider the case where T' = 0, i.e., the star tree, located at the origin of BHV 
space. In this case, the kernel integral, 

c(T',h) = J k(T,T,h)dT , (3) 

is symmetric over each orthant comprising BHV treespace. Within each orthant, the integral 
is equivalent to the normalizing constant of a zero-mean multivariate normal truncated to K+. 
Thus, expression ([3]) is equivalent to the number of orthants in the space, no, multiplied by the 
corresponding truncated normal constant. 

On the other extreme, consider a central tree T' such that every edge is large compared to 
the bandwidth h, i.e. the tree is relatively far away from any orthant boundary. In this case, the 
kernel integral will be very close to the value given in expression ([2]). If the central point is placed 
arbitrarily far away from any orthant boundary, then the integral over any orthant other than the 
one containing T l can be made arbitrarily small. Thus, the integral over the orthant containing T' 
itself will be an increasingly good estimate of the entire normalizing constant as the central point 
is moved further away from orthant boundaries. 

The updated version of kdetrees presented in this paper improves on the first generation 
algorithm by estimating the kernel normalizing constants c(T ', h ). This is accomplished by finding 
bounding functions in each orthant which can be more easily integrated than the true kernel 
function. While some analytic simplification is possible, certain expressions cannot be evaluated 
other than through numerical methods. 
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1.1 Holonomic Gradient Method 


The holonomic gradient method (HGM) is a non-stochastic numerical method for calculating certain 
types of integrals. The HGM is a variation on the gradient descent method of function optimization, 
and is suitable for application to holonomic functions |Nakayama et al. , 2011| Koyama et al 


2014 


Roughly, a holonomic function is a solution to a homogenous ordinary differential equation with 
polynomial coefficients Zeilberger, 1990 . Several integrals of interest to statisticians turn out to 


be expressible as solutions to an optimization problem within a holonomic system. 

For our present problem two cases are of particular use. Marumo et al. 2014 demonstrates the 
use of HGM to calculate the normalizing constant for a multivariate normal distribution truncated 
to the positive orthant, i.e., M+. In addition, Hayakawa and Takemura 2014 provides the constants 
for the so-called exponential-polynomial family of probability densities, 


f(x |0i, ...,6 k ) oc exp (xOi + ... + x k 9 k S j . 

As was briefly discussed in section [lj BHV treespace is a simplical complex of positive Eu¬ 
clidean orthants, and the normalizing constant for a truncated multivariate normal distribution is 
an ingredient for a scheme to approximate the kernel constants in BHV treespace. In section [3] we 
show that it is possible to use the normalizing constants for the truncated multivariate normal and 
the exponential-polynomial family, computed either by HGM or some other method, to construct 
approximations to the kernel normalizing constants for BHV space. 


2 Methods 


2.1 Normalizing Constants 

In this paper we use k to denote the unnormalized kernel function, i.e., with unit constant of 
proportionality in ([!]). If we are given a fixed tree To and bandwidth h, our objective is to compute 
bounds for the integral K(Tq, h) = f k(T , To, h) dT over the entire BHV treespace, so that we may 
normalize the kernel function. 

One suitable lower bound function is based on the use of the triangle inequality. 

Lemma 1. For any pair of trees, k(T, T', h ) > k(T, T', h). Where, 


k(T, T ', h) = exp ^ 


(d(T, 0) + d(0, T')) 2 \ 

h 2 J ' 


Proof. This is an immediate consequence of the fact that the geodesic path between any two 
trees is the shortest path connecting the trees. In particular, it is shorter than the cone path, 
d(T, T') < d(T, 0) + d(0, T'). □ 

However, k does better than simply providing a global lower bound for k. In fact, the bound is 
sharp, as k is equivalent to k whenever the geodesic between T and T' passes through the origin. 
This turns out to be a quite common occurrence, as geodesics between trees which are not separated 
by a small number of NNI interchanges are likely to pass through the origin. As a result for much 
of the space, integrating over k will be equivalent to integrating over k itself. Happily, integrating 
k over a single orthant affords an opportunity for analytical simplification. 
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Theorem 2.1. Let O be an arbitrary fixed orthant in BHV treespace, and let p denote its dimen¬ 
sion. Then, the integral of k(T,T', h) over that orthant is given by the expression 


C 0 (T',h ) := / k{T,T',h ) dT 
Jo 

7r p/2 e -d(0,T') a /fc a 


2P- 1 r(p/2) 

Where, if we let 0\ = — 2d(0, T’)/h 2 and 62 = —h~ 2 , then 


A(T\h). 


A(T', h) = I r p 1 exp {0\r + 0 2 ?’ 2 ) dr. 
do 


(4) 


(5) 


Proof. The distance d(T, 0) is the usual / 2 -norm of the vector of edge lengths for the tree T, and O 
is the positive orthant M+. Thus, if we express the integral over O in an angular coordinate system, 


(ri(T,0) 2 +2d(0,T')rf(T,0)) 

e h 1 dT 


d(o ,t'y r 

C 0 (T',h) = e-^~ 

Jc 

-d(o.T') 2 r°° r „ , „ , 

= e / / e dlr+02r dV (r, 0) 

Jo Je 

Now the volume element in MP in an angular coordinate system is 


( 6 ) 


v -1 

dV(r, 0) = r p ~ 1 dr sin p ^ k ^ 1 (6k) dOk, 

k =1 


and it so happens that one of the definitions of the Beta function yields, 

/2 1 

sm p - k -\e) d6 = -B((p - k)/ 2,1/2). 

Integrating in all the radial coordinates yields a product of beta functions which telescopes down to 
the constant appearing in Q. This reduces the problem to a single integral in the radial coordinate, 
which is equivalent to A(T ', h ). □ 



Unfortunately, the function A(T ', h) has no general closed form solution. However, there are 
several methods that we can use to obtain a numerical estimate of this value. The HGM method 
developed in Hayakawa and Takemura 2014 is one such method for obtaining this value. It is also 
reasonable to calculate this particular integral using classical quadrature methods. 

Lemma 2. Following the notation of Hayakawa and Takemuraj 2014 , 


A(T',h) = d p - 1 A 2 (9 1 ,0 2 ). 


Here, A 2 (9i, 0 2 ) is the normalizing constant for the exponential-polynomial distribution of order 2, 
the 6 are defined as in Theorem 2.1 and d™ means the m-th partial derivative with respect to 0\. 


Furthermore, Hayakawa gives the following equivalence for the first partial derivative 


diA 2 (6i,6 2 ) = — —— {1 + 0 i.A2 (0 i, 6 * 2 )} , 

2172 

and for the the higher derivatives, m > 2, the partials can be expressed recursively in terms of 
lower order derivatives, 

^2(01,02) = 1)^- 2 4 2 (0i,0 2 )+ 

ZC72 

0ic> 1 m - 1 4 2 (0i,02)}. (7) 
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Proof. See Hayakawa and Takemura 12014 , Section 2, equations (4) and (7). The latter expression 
can be easily obtained by differentiation of the expression for the first partial. □ 


These results are sufficient to use the hgm package described in Koyama et al. 2014 to imple¬ 
ment the lower bound for the orthant integral, C 0 (T 7 , h). 

The desired normalization constant for function 0 can be decomposed as the sum of integrals 
over each orthant in BHV space. Thus, if no is the number of orthants in the BHV space, then 
n 0 ■ C_o{T', h ) is a crude lower bound for the overall normalizing constant. Although this is a poor 
bound, it can be improved by obtaining better bounds for the contribution from various orthants 
and adjusting accordingly. 

The most obvious orthant to begin with is the orthant containing the “central” tree T ', which 
we shall call Ot'- This is the orthant where the difference between k and k will be the greatest, 
and thus the largest improvement to the bounding constant is to be found here. Note that in this 
case, the integral over Ot> is given by, 


Co T , (T', h) = / exp (d{Tf Tf/h 2 ) dT 

J Orpl 

= / exp (—||:r — XT'W/h 2 ) dx. 


( 8 ) 


This is simply the integral of a radially-symmetric multivariate gaussian kernel centered at the 
point T over the positive orthant. Such a normalizing constant can also be calculated using HGM, 


and an implementation is included in the hgm R package Koyama et al., 2014 


Further improvements to the integral for orthants which adjoin directly to Ot 1 can be made by 
noting that a relationship similar to that of Lemma [l] will hold, but with the third point in the 
triangle inequality being somewhere on the orthant boundary, instead of the origin. However, in 
practice the improvements to the bounds obtained in this way are quite small, given the typical 
values of the bandwidths which occur in practice and the small number of orthants to which 
the calculations apply. For this reason, and in the interests of controlling the overall numerical 
complexity of the kdetrees algorithm, these “second-order” approximations are not implemented 
at at this time, but may appear in future updates. 


2.2 Outlier Test 

we chose to implement a outlier test of the form, /(Tj) < c*, where the 
critical value c* is selected using Tukey’s quartile method, 

c* = Q 1 -k(Q 3 -Q 1 ). 

Here, f(T) denotes our density estimate for tree T, and the quartiles, Qi,(j 3 ,are calculated using 
all observed tree density estimates. 

Further experimentation with the method has suggested that better performance is obtained if 
the tree density scores are transformed to the log-scale before the classification step takes place, 
log /(Tj) < c*. This transformation was chosen because the raw scores, /(Tj), are themselves 
bounded below by zero, and the log transformation removes this bound. The quantiles used to 
compute the critical value are also obtained using the log-transformed scores. Due to the better 
performance characteristics of this method, the default classifier algorithm for kdetrees has been 
changed to operate on the log-density scale. 


In 


Weyenberg et al. 


2014 
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2.3 Leaf Edges 


The dimension of the orthants comprising tree space is determined by the number of taxa in the 
trees, with each edge in the fully-resolved tree contributing a dimension to each of the orthants 
comprising the space. However, the n leaf edges are represented in the space in such a way that the 
space can be decomposed into a Cartesian product S x M+ . The portion of the space S is associated 
with the internal edges of the trees, while the positive euclidean orthants M+ are associated with 
the leaf edges Billera et al, 2001 . Because of this decomposition, there is not a large amount of 


topological information contained in the portion of the space corresponding to the leaf edge lengths. 

If we remove the leaf edges from the calculation, the dimension of treespace can be reduced by 
approximately half, while retaining the important topological information. This has the benefit of 
simplifying the overall density estimation problem, as well as the computation of the normalizing 
constant estimates by the HGM methods. While the original kdetrees algorithm included the 
leaf edges in the geodesic calculations, the updated version omits them from consideration. A new 
option flag allows for restoration of the original functionality. 


3 Results 

The updated version of the kdetrees software package is available both from CRAN (the official R 


package system), as well as from the official development repository on Github. (github.com/grady/kdetrees). 


3.1 Simulations 

A set of simulated datasets were constructed and analyzed, using a similar design as the first 
simulation described in Weyenberg et al. |2014|. The simulations measure the true and false positive 
rates for identification of known outlier trees within a set of trees drawn from a common distribution. 
Results of the simulation are summarized as ROC curves, and are presented in Figure [l] 


3.2 Apicomplexa 


The Apicomplexa data set presented by Kuo et al. 120081 consists of trees reconstructed from 268 
single-copy genes from the following species: Babesia bovis (Bb) |Brayton et al. 2007| (GenBank 
accession numbers AAXT01000001-AAXT01000013), Cryptosporidium parvum (Cp) [Abrahamsen 
from CryptoDB.org [Heiges et al. 2006] , Eimeria tenella (Et) from GeneDB.org [Hertz- 


et al. 2004 


Fowler et al., 2004 , Plasmodium falciparum (Pf) [Gardner et al . 2002] and Plasmodium vivax (Pv) 
from PlasmoDB.org Bahl et al. 2003 , Theileria annulata (Ta) [Pain et al., 2005 from GeneDB.org 


|Hertz-Fowler et al. 2004 , and Toxoplasma gondii (Tg) from Toxo-DB.org Gajria et al. 2008 


A free-living ciliate, Tetrahymena thermophila (Tt) [Eisen et al. 2006 , was used as the outgroup. 
To this set of sequences, we appended the Set8 gene, which has been identified by K ishore et al. 
2013 as a probable case of horizontal gene transfer from a higher eukaryote to an ancestor of the 


Apicomplexa. This is the same data set analyzed as part of the original kdetrees paper, which 
was analyzed again with the updated algorithm. The new set of outlier trees is presented in Table 
[lj The newly identified set of outlier trees are presented in a series of figures at the end of this 
manuscript. The figures are in ascending order by the kdetrees tree score, i.e., the first tree depicted 
is the furthest outlying tree. 
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Figure 1: Receiver operating characteristic (ROC) plots comparing the updated kdetrees al¬ 
gorithm with Phylo-MCOA. ROC plots summarize the true and false positive rates for a binary 
classifier as the tuning parameters are changed. A perfect classifier would be represented as a sin¬ 
gle point at the upper left corner of the plot, while a completely random scheme would follow the 
dotted diagonal lines. Curves are shown from simulated coalescent data, using a variety of effective 
population sizes (N e ). Larger values for N e correspond to more variability in the generated trees, 
and thus a more difficult classification problem. 
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Table 1: Apicomplexa gene sets identified as outliers by the updated kdetrees. Genes which were 
not identified as outliers by the original algorithm are marked with a *. 


No. a 

GenelD'’ 

Functional Anno¬ 
tation 

Analysis 

* 472 

PF14_0059 

hypothetical pro¬ 
tein 

Tree topology inconsistent with phylogeny. Bb and Cp on same branch, 
with Ta distant from sister species Bb. Sequence alignment looks good 
in some regions, but with numerous gaps and other regions with poor 
alignment. Multiple homopolymer stretches in Pv and Pf. 

* 478 

PF14_0326 

hypothetical pro¬ 
tein 

Tree topology not consistent with phylogeny of the species. Bb 
branches with the outgroup Tt instead of its closely-related sister 
species Ta. Poor alignment with numerous gaps, numerous homopoly¬ 
mer stretches, particularly in Et. 

488 

PF08-0086 

RNA-binding 
protein, putative 

Significant sequence length disparity (164 a.a. for Ta vs 1075a.a. for 
Pf). Generally good sequence alignment in one region of 100 residues; 
otherwise, alignment is poor. 

* 505 

PF14_0143 

protein kinase, 
putative 

Ta/Bb and Pf/Pv not monophyletic; split by outgroup Tt. Good 
sequence alignment in multiple blocks, but significant sequence length 
differences. Pf/Pv have multiple insertions and Et and Cp sequences 
are truncated. 

515 

PFA0390w 

DNA repair ex¬ 
onuclease, puta¬ 
tive 

Short sequences for Et and Cp. Several homopolymer stretches in Et. 
Modest to good alignment in multiple blocks, Et being an exception 
in several regions. Possible incorrect annotation of Et sequence. 

* 553 

PFC0730w 

conserved pro¬ 

tein, putative 

Tree topology inconsistent with phylogeny. Bb and Ta are distant 
not monophyletic with Pv/Pf. Short regions exhibiting good sequence 
alignment. Et sequence is truncated. 

* 578 

PF14_0042 

U3 small nu¬ 
cleolar ribonu- 

cleoprotein, U3 
snoRNP putative 

Tree topology very inconsistent with phylogeny; Tg branch with out¬ 
group Tt, Et branch with Bb and Ta. Poor alignment. Significant 
sequence length differences; Tg sequence is 4126 residues in length, Cp 
and Tt 2000 residues, Pf and Pv are 450 residues. 

* 585 

PF10_0054 

hypothetical pro¬ 
tein 

Cp exhibits anomalous placement in tree. Significant sequence length 
differences; Pf, Pv, Tg about 1100 residues, Et only 349 residues, so 
numerous gaps in alignment. Some regions show good alignment. 

* 588 

PFI1020c 

Inosine-5’- 

monophosphate 

dehydrogenase 

Sequence alignment looks reasonably good. Tree shows Cp branching 
with outgroup Tt and distant from other Api species. Bb split from 
Ta. 

* 630 

PFL2120w 

hypothetical pro¬ 
tein, conserved 

Tree topology inconsistent with phylogeny. Cp branching with Pv/Pf. 
Ta/Bb not monophyletic with Pv/Pf. Several blocks of sequence show¬ 
ing good alignment, but numerous gaps, due mostly to Ta insertions. 

641 

PFE0750c 

hypothetical pro¬ 
tein, conserved 

Et on a very long branch with other species tightly clustered. Large 
difference in sequence lengths; 269 residues for Et vs. 848 for Pf. Cen¬ 
tral region with modest to good alignment; Et exhibited poor sequence 
identity. 

* 645 

PF14_0635 

RNA binding 

protein, putative 

Tree topology looks proper, although Pv and Pf are on a somewhat 
long branch. Modest to good alignment. 

* 662 

PF11_0463 

coat protein, 

gamma subunit, 
putative 

Multiple homopolymer stretches in Et sequence. Generally good align¬ 
ment for all but Et; sequence might not be homologous. 

* 725 

PF14.0428 

histidine - tRNA 
ligase 

Tree topology appears proper, but Pf/Pv on long branch. Good align¬ 
ment in two large blocks, but significant gaps and poor alignment in 
other regions. Et sequence truncated (339 vs. 1000 residues for oth¬ 
ers). Ta sequence also truncated (583 residues). 

* 745 

PF1UXM9 

hypothetical pro¬ 
tein, conserved 

Ta and Bb branch is distant from other Api species, which cluster 
tightly. Regions of good sequence alignment by with several large 
gaps. Sequence length differences; Pf and Pv = 3300 residues, Tg and 
Cp = 2600, Et only 347. 

* 750 

PFE1050w 

adenosylhomocysteirfEseand Bb somewhat distant from other Api species. Relatively good 
(S-adenosyl-L- sequence alignment, although Et sequence truncated (291 vs. 480 for 

homocystein e others), 

hydrolase) g 


2008 


“Based on geneset designations in Kuo et al. 
b Geneset represented by GenelD for Plasmodium falciparum. 








4 Discussion 


4.1 Apicomplexa 

The outliers identified by kdetrees in the Apicomplexa dataset are substantially different than 
those reported in our original paper. In the original paper, many of the outliers were trees containing 
a single edge much longer than any other edge. This was largely attributable to one or more poorly 
aligned sequences within the larger multiple sequence alignment. Thus, the disproportionate edges 
were often leaf edges, and as a result of the changes in the algorithm, the leaf edges are no longer 
taken into consideration by the default settings. As a result, many of the trees identified as outliers 
in the original paper are no longer identified as outliers by the default parameter settings of the 
improved version. Of course, there is a new option flag available, which can be used to restore the 
behavior found in the original paper. The software also retains the “dissimilarity mode” from the 
original implementation, which always uses the terminal branch length information. 

The cumulative result of the changes in the algorithm is an increased focus on differences in 
topology in the dataset. The new set of 16 outlying trees differ from the non-outliers primarily 
in the placement of the Bb, Cp, Ta, and Tt genes within the trees. In the non-outlier trees, Cp 
generally forms a clade with Tt, while Ta forms a clade with Bb. In the outlier trees, however, 
these taxa are placed in widely varying locations within the trees, as demonstrated by the tree 
drawings of outlier trees found at the end of this manuscript. 

Together, these results demonstrate that the updated kdetrees algorithm is more sensitive 
to topological differences in the trees than the previous version, at the expense of the loss of 
information from the terminal edge lengths. However, the original functionality using the terminal 
edge information is still available for use, by setting the appropriate flags. 


4.2 Simulations 


The performance of the classifier with the simulated datasets is substantially better than the original 
version of the algorithm. Although there is a modest performance penalty associated with the 
estimation of the normalizing constants, kdetrees remains significantly faster than the competitor 
Phylo-MCOA algorithm |de Vienne et al., 2012 


When the non-outlier trees are drawn from a single coalescent distribution, the performance 
of the classifier is nearly perfect, identifying the correct outlier in every simulation iteration, even 
when the variance of the coalescent distributions (controlled by the effective population size param¬ 
eter N e ) was quite large. (Of course, due to the nature of the classifier, false positives are inevitable 
if the tuning parameter is chosen poorly.) In the more difficult cases where the non-outliers were 
drawn from a mixture of coalescent distributions, the updated algorithm remained superior to the 
Phylo-MCOA algorithm, showing greater area under the ROC curve for all cases except for the 
case of the most highly variable 5-part mixture distribution for the non-outlier trees. 


5 Conclusion 

Our proposed method is motivated by the fact that existing methods of phylogenetic analysis and 
tree comparisons are not adequate for genomic scale phylogenetic analysis, particularly in cases 
of certain non-canonical evolutionary phenomena. Furthermore, the scenario in our mixed coales¬ 
cent distribution simulation—where the non-outlier trees are sampled from an unknown mixture 
of distributions—cannot be handled by parametric methods, with the possible exception of the 
genome spectral methods. However, even the genome spectral methods ignore possible statistical 
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dependencies between different feature spectra. In contrast, we propose analyzing a collection of 
gene trees without reducing gene trees to summarizing information. Our kdetrees approach also 
possesses a considerable advantage in speed over other methods, which is of paramount importance 
for a tool used in whole-genonre phylogenetic analysis. 

In addition, one of the applications of our method is an inference on the species tree or a tree 
that reflects the evolution of most genes in the genome. We can use our method to identify genes 
which produce discordant trees (outlying trees) and then we can remove them from phylogenetic 
analysis. By doing this we can use the genes that share the same evolutionary history and we can 
build a tree that reflects the evolution of the species or that of most of the genome. 

We have been interested in developing a phylogenomic pipeline that is convenient and accessible, 
as well as robust. To accomplish this aim, an important problem that needs attention is to com¬ 
paring thousands of gene phylogenies across whole genomes. Thus, our approach is focused on the 
efficiency of the algorithm in terms of computational complexity and memory requirements, with 
less emphasis on achieving the highest classification accuracy possible. Thus, it is very important for 
us to improve the accuracy of the method while we maintain the speed of the algorithm. Compared 
with the original kdetrees and the Phylo-MCOA algorithm, in this paper, we have demonstrated 
the significant improvement in the accuracy of the method without costing the computational speed 
(see the simulation results in Figure [l]). 

In future work we intend to apply kdetrees to clustering trees based on similarity. Unsupervised 
clustering is an important method to learn the structure of unlabeled data. The aim of clustering 
methods is to group patterns on the basis of a similarity (or dissimilarity) criteria where groups 
(or clusters) are set of similar patterns. 

Many traditional clustering algorithms (e.g., K-Means, Fuzzy c-Means, SOM and Neural Gas) 
have the version of kernel-based algorithms |Hur et al\ |2000[ |2001[ [Camastra and Verrij |2005| . The 
use of kernels allows to map implicitly data into an high dimensional space, called feature space; 
computing a linear partitioning in this feature space results in a nonlinear separation between 
clusters in the input space. Using kernels defined in the space of trees, mapped into vector spaces, 
to evaluate trees’ structural similarity allows us to use Kernel-based clustering methods to group 
trees in the space. 
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